Introduction
The aim of this exercise is to implement a parallel version of Conway’s Game of Life, which is a zero-player game (i.e. its evolution is determined by its initial state, requiring no further input). In our program, the initial configuration is randomly generated and there is the possibility to observe how the world evolves through some snapshots.
The universe or the world of the Game of Life is a two-dimensional grid of square cells, and each of them interacts with its eight neighbors, which are the cells that are horizontally, vertically, or diagonally adjacent.
The grid behaves like a torus: the first line will be adjacent to the last line and the second line, and the same is done by columns.
The rules given by the assignment are the following:
There could be different evolution types, according to what we define as “neighbors” of a cell.
In fact, if we decide that we want to evaluate the state of the world “one cell after the other”, then we need to do an ordered evolution, i.e. the evaluation of a cell depends on the state of its left neighbor and its above neighbors. In this scenario, we are considering an implicitly serial procedure.
Another option would be to evaluate the state of the world “one cell independently from the others”. This means that while in the previous case we needed to consider the update of the neighbors before evaluating a certain cell, now we can just look at the initial world state and evaluate independently each cell. At the end, we just put all these updated cells together and they compose the new state of the world. This will be referred to as static evolution.
Implementation
Main function and arguments passing: main.c
In the main function, we wanted to handle the parameters passed to the command line. We chose to define some default values of these parameters: the default world size is 100, the evolution type is ordered, the number of iterations is 50, the snap recurrence is 5, the default file name is init and, by default, no action is done.
When we pass different parameters, the default values are overwritten by the new, current values. Due to what was asked by the assignment, if the snap recurrence is 0 then the snap is done only at the end of the game.
If the user asks for action through -i command, then the choose_initialization function is called.
If the user asks for run through -r, it is possible to choose between ordered evolution with -r 0 by invoking run_ordered and between static evolution with -r 1 by invoking run_static. Both the run functions take as inputs the file name, the number of iterations of the game and the recurrence of the snapshot.
World initialization: w_init.c
In the main function, the function choose_initialization is called. This defines a world of size size*size and of type unsigned int. We preferred to use chars instead of integers or double or whatever because they belong to the smallest type that is able to contain both 0s and 255s, that are the outputs of read_write_pgm.c functions.
Then, if the number of processes is greater the one, a parallel function will be called. Otherwise, the program will run a serial version of that.
initialize_serial.c:
This function takes the file name, the world and the size of a square matrix as inputs. It allocates an array of unsigned chars of dimension size*size and it fills it in the following way;
for(long i=0; i<size*size; i++){
int val = rand()%100;
if(val>70){
world[i]=MAXVAL; //white = dead
}else{
world[i]=0; //black = alive
}
}
We chose the indexes to be long because int range is [−32,767, +32,767]: too small for our purposes.
Then we wrote that world on the pgm file through the write_pgm_image function and we freed the previously allocated space.
initialize_parallel.c:
This time, the function takes as parameters also the process rank and the overall number of processes: we wanted for each process to generate its own world part. To do that, we defined a process_world array which contains a certain number of world rows previously determined.
To divide the workload, we used the following algorithm:
int* rcounts = (int *)malloc(pSize*sizeof(int));
int* displs = (int *)malloc(pSize*sizeof(int));
int smaller_size;
int cumulative=0;
if(pRank==0){
for(int i=0; i<pSize; i++){
smaller_size = size%pSize <= i? size/pSize: size/pSize+1;
rcounts[i] = smaller_size*size;
displs[i] = cumulative;
cumulative = cumulative+rcounts[i];
}
}
rcounts and displs are 2 arrays that contain respectively the number of elements that each process should be manage and the one containing the index of the starting element for each process.
To fill them, we can establish that if the modulus of the number of lines with respect to the number of processes is greater than the index of a process, then that process will manage that one line more. We chose to just compute this arrays with the process 0. For example, if there are 10 lines in a matrix which need to be divided between 4 processes, then the process 0 will have 3 rows, the process 1 will have 3 rows, process 2 only 2 rows, such as process 3. Then these values are multiplied by the number of elements in each row, obtaining (in the case of a square matrix) a rcounts array made of 30-30-20-20 elements.
To compute displs, we just needed to cumulatively add the number of elements that are assigned to each process, in order to obtain the index of the first element of each process: 0-30-60-80.
These arrays were broadcasted among the different processes by using:
MPI_Bcast(rcounts, pSize, MPI_INT,0, MPI_COMM_WORLD);
MPI_Bcast(displs, pSize, MPI_INT, 0, MPI_COMM_WORLD);
Focusing on a single process, we generated in parallel (through OpenMP) a random seed for each MPI process and we divided the workload among different threads. After this, we gathered the generated set of lines to the process 0 by using:
MPI_Gatherv(process_world, rcounts[pRank], MPI_UNSIGNED_CHAR, world, rcounts, displs, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
This returns to the choose_initialization function and, if the rank of the process is 0, these parts are printed in a snapshot.
grow_ordered.c
This is the first evolution mode and it consists in waiting the update of the previous cells before evaluating a new one. Due to its serial nature, we chose to just implement it in serial.
When run_ordered is called, the world is read from the file called filename (an input parameter) and it is saved on a world array of unsigned chars.
Then a grow function is called on it, and it performs a number of iterations equal to the ones specified by the user (times variable, another run_ordered input parameter). At each iteration, the update_world function is called and the snapshot is taken only if i%snap==0.
The updated_world function has the aim of selecting a cell, look at its neighbors and establish whether the unit must live or die. Since the world grid is a torus, we needed to manage the edge cells by defining as neighbors the cells in the opposite places in the matrix.
For example, in a 4x4 matrix, the cell is the 0 position has as neighbors the cells with indexes: 15, 12, 13, 3, 1, 7, 4 and 5.
This can be done by the use of modules or by a branch conditions.
We used modules in the following way:
for(long i=0; i<sizex*sizey; i++){
long left = sizex*(i%sizex==0);
long right = sizex*((i+1)%sizex==0);
long square = sizex*sizey;
long up = square*((i-sizex)<0);
long down = square*((i+sizex)>=square);
unsigned char status=(
matrix[i-1 + left] // west el
+ matrix[i+1 - right] // east el
+ matrix[i-sizey + up] // north el
+ matrix[i+sizey - down] // south el
+ matrix[i-1-sizey + left + up] // north-west el
+ matrix[i+1-sizey - right + up] // north-east el
+ matrix[i-1+sizey + left - down] // south-west el
+ matrix[i+1+sizey - right - down] // south-east el
)/MAXVAL;
}
This algorithm applies a correction to each cell that avoids a branch condition. If the element is on the “border”, then the algorithm will show a positive value for at least one of the “correction variables” left, right, up or down.
In fact, if the first element of a 4x3 matrix is considered, its neighbors will be determined by: 3 (west element: left=sizex=4), 1 (east element), 8 (north element), 4 (south element), 11 (north-west element), 9 (north-east element), 7 (south-west element) and 5 (south-east element).
We chose to use a lighter algorithm because ….
for(unsigned int k=0; k<sizex*sizey; k++){
long col = k%sizex;
long r = k*invsizex;
long col_prev = col-1>=0 ? col-1 : sizex-1;
long col_next = col+1<sizex ? col+1 : 0;
long r_prev = r-1;
long r_next = r+1;
int sum = world[r*sizex+col_prev]+ // west el
world[r*sizex+col_next]+ // east el
world[r_prev*sizex+col]+ // north el
world[r_next*sizex+col]+ // south el
world[r_prev*sizex+col_prev]+ // north-west el
world[r_prev*sizex+col_next]+ // north-east el
world[r_next*sizex+col_prev]+ // south-west el
world[r_next*sizex+col_next]; // south-east el
sum = sum/MAXVAL;
}
In both cases, status or sum contain the number of dead elements, as defined in w_init.c. We decided to test both of them 10 times on matrices of 15.000x15.000 elements in order to establish which of them was the faster method. It turned out that the modules method took a mean of 30.08583 seconds with standard deviation equal to 0.1643053. The second method instead took only a mean of 15.91139 with a standard deviation of 0.03571708, turning out to be undoubtely the best algorithm to evaluate the world.
grow_static.c
This is the second method we implemented to update the world, and the one that gives the most interesting possibilities with the use of MPI and OpenMP.
This program reads a given file, uses it as initial status and evolves this status for the desired number of iterations, saving a snapshot whenever iteration%snap==0 and, at the end, returns the time needed for the execution.
The main function of the algorithm is run_static, which initializes the MPI environment and all the instruments needed for the execution in the parallel case.
It reads the starting file, ensuring that the number of processes is smaller or equal than the number of world rows: if not, a message error is produced.
A temporary world is created, which exceeds the real world grid because of 2 more rows: one “above” the world, one “below”. These two are placed there in order to copy the values of the first true row in the last row and the values of the last true world in the first one.
At this point we needed to divide the workload among processes in the exact same way as we did in initialize_parallel.c, but this time we sent for each process also the two additional rows.
As we said before, in a 10x10 matrix divided among 4 processes, 3 rows will be given to process 0, 3 to process 1, 2 to process 2, 2 to process 3. This time there will be 2 additional rows, so the matrix will be 12x10 and the process 0 will take 5 rows, the process 1 also 50, while the last 2 processes will take only 40 rows each. Obviously, these processes’ rows will be partially overlapping:
IMMAGINE
Then a different function to grow the evolution model is called: grw_serial_static whether the detected number of MPI processes is 1, grw_parallel_static otherwise.
These two functions then call the proper routine to update the world and save the output whenever needed. The update is done by using a second, auxiliary, world: at each iteration, the algorithm uses one world to read the current status, updates the status and saves the result in the other one.
To avoid waste, the updated world becomes the one on which performing an evaluation, and the “old” one becomes the one on which saving the update. Due to performance reasons, a double pointer is used to determine which world is the reading one and which one is the updating one.
unsigned char * ptr1=(i%2==0)? new_world: temp_new_world;
unsigned char * ptr2=(i%2==0)? temp_new_world: new_world;
evaluate_world(ptr1, ptr2, size, (long)scounts[pRank]*invsize, times, pRank, pSize, &status, &req);
Inside evaluate_world, the algorithm used to do the single update is the second one we explained on the grow_ordered.c paragraph. After the update on the second world, the first and the last rows of each process need to be exchanged among processes.
At each iteration, each process updates its group of rows (excluding the supporting rows), then MPI_Isend and MPI_Recv are used to exchange the rows between the processes so that each process will be able to have also the supporting rows updated.
IMMAGINE
OpenMP is also used to open parallel regions for each MPI process that will further parallelize the work in the loops.
If the number of iteration is divisible for snap, then the program gathers all the updated processes worlds and join them in a global world that will be printed in a snapshot.
if(i%snap==0){
MPI_Barrier(MPI_COMM_WORLD);
MPI_Gatherv(&ptr2[size], rcounts_g[pRank], MPI_UNSIGNED_CHAR, world, rcounts_g, displs_g, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
if(pRank==0){
char * fname = (char*)malloc(60);
sprintf(fname, "snap/snapshot_STATIC_%03d",i+1);
write_pgm_image(world, MAXVAL, size, size, fname);
free(fname);
}
}
Results & discussion
Since the ordered evolution is implicitly a parallel procedure, we used only the static evolution to evaluate the scalability of our program. Here we will be discussing:
OpenMP scalability: we fixed the number of MPI tasks to 1 per socket and we increased the number of threads from 1 to 64 (the number of cores available on the socket) per socket. We used 2, 4 and 6 sockets on a \(20.000\)x\(20.000\) matrix and we reported the behavior.
Strong MPI scalability: we used a \(25.000\)x\(25.000\) matrix and we increased the number of MPI tasks by using 2 nodes. We started from 1 MPI task to \(128 \cdot 2 = 256\).
OpenMP scalability
To test our program, we had run it on ORFEO on the EPYC nodes with the following settings:
OMP_PLACES=cores and OMP_PROC_BIND=close, in order to have the maximum computational power while using cores as close as possible to maximize also the memory usage, and OMP_NUM_THREADS going from 1 up to 64 (the number of cores of a single socket);
matrices with size 20.000 x 20.000, 50 iterations for each trial, each of them repeated for 5 times in order to have a mean result and trying to minimize the effect of some possible outliers;
3 different runs on 2, 4, 6 sockets, mapping the MPI tasks by socket (while using 1 node) or by core with binding to socket (while using 2 or 3 nodes);

It’s clear from this graph that:
If another matrix dimension is considered, it is possible to observe that the two previous considerations are still valid, but now the speedup is higher than before.

cercare di mettere nello stesso colore le stesse entità, qua 1s in 20.000 è rosa come 2s in 25.000. è una roba da nulla ma mi turba in fase di letturaaaa
MPI scalability
We tested how grow_static.c functions are able to scale when invoked on a growing number of MPI tasks. We had only 2 nodes available, so we decided to test the program with a number of MPI tasks from 1 to 256.
Some implementation notes: we wrote a run_static function such that is process 0’s business to initialize the temporary world, containing 2 extra rows, and to scatter the correct rows to each process.
We measured with an increasing number of MPI tasks two entities: the overall time necessary to perform the required 50 iterations of the Game of Life, and the time required only by the MPI_Scatterv function in each repetition.
Then we computed the speedup considering the overall time and the time without considering the MPI_Scatterv routine:

It’s very easy to see that the difference between these two lines increases when the number of MPI tasks increases. So, we tried to write a program without using the MPI_Scatterv function that clearly turned out to be a performance killer.
The cost of removing MPI_Scatterv is that now each process needs to know the entire world before evaluating its own part. This implies that the memory usage is greater, but we could expect better performances due to the removal of the previous message passing overhead.
Below, a graph that compares

To make sure that the wiggly trend observed in the no-scatter version was just an unfortunate run, we repeated the benchmark with a 20.000x20.000 matrix:

It seems that the first “gap” in correspondence of #MPI tasks = 59 is common above the 2 trials. On the opposite, the greater downfall that can be observed for 25.000x25.000 matrices with almost 140 MPI tasks seems only to be an unfortunate measure.
It is also clear that scalability improves with greater matrices.